White wine story

by Alena

Leonardo da Vinci once said the discovery of a good wine is increasingly better for mankind than the discovery of a new star.

What is the good wine? There are many factors that make the taste and quality of wine unique. Some of them that are going to be looked at are: acidity, pH level, sugar remained in wine and chlorides.

The data is variants of the Portuguese “Vinho Verde” wines, which is available https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv, contains:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Making ourselves familiar with given data is the first step to look for quality and quantity of it.

First, what observations and features of data do we have? What data fields are there?

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Then, values (attrubutes), strucrure and, finally, some general statistics

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The normal range for fixed acidity is 6.3 to 7.3 g / dm^3. As for sugar, 75% of wines in our dataset have below 9.9 mg / dm^3 sugar remaining after fermentation stops. Average alcohol percentage in our dataset is about 10.51

Analyzing data by looking at its destributions

Since we have great measure of wine given to us - “quality”, we can explore its dependance on other variables or what make while wine better quality.

For better understanding, let’s create ordered factor of quality, which help us see differentce between, for example, acidity and quality on plot.

Before we jump right in analyzing differences, we need to see destribution of acidity level, quality and pH in our data.

That is really good that we have normal distribution of acidity, pH, density and alcohol in our data.

Closer look at density and alcohol means support our point of normal distributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Now, time to compare it to quality with box ploting, which let us better evalate the difference of quolity factors:

We see that we have some outliers, but it does not affrect over all picture much, so we leave them. The result is interesting, we don’t have any quality factor that stands up. All of them are pretty equal. Moreover, we have three type of acidity: fixed - usualy has a range between 6 and 8, where volatile has .1 and .5 and citric - from 0 to 1.

I want to check citric and volatile acidity levels in different quality factors:

Here, we have the most of wines in our dataset are between 0.25 and 6 for volatile and .2 and .5 for citric acidity.

Closer look at volatile.acidity and citric.acid means should support our point of normal distributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

However,this gives us unexpected result where volatile acidity level is a little bit above normal range by noticeing maximum values of it.

Look at comparison of quality and pH level with boxploting as well:

We see that most of our avarage of pH data alines in a range between 3.2 and 3.3 that is normal for wines because they are ususally in the range of 3 to 4. However, we do see some data that hits over 3.6 mark of pH level that means only that high-pH wine will taste flat and lack freshness where a low-pH wine will taste tart, owing to the higher acid concentration." (https://winemakermag.com/547-phiguring-out-ph)

Over all we get expected results of acidity and pH level of white wine to its quolity. Here, we notice that the premium quality of wine has fixed acidity between 7 and 8 g / dm^3 and pH level closer to 3.3 on pH scale.

We have a lot of factors (variables) to look at quality of white wine, but as we have stated in the beginning, we will take a look only at a couple of them right now. Acidity and pH level were explored, now it is time to take a look at sugar and chlorides. Let’s look at Sugar’s destributions and impact of it on quality:

Sugar in our dataset has a pick close to zero and maximum values as 60 (see have seen in summary), so we take logarithm to see more clear destribution. We do not anticipate that it will have a destribution which has 2 hights.

Where boxplots do not explain the left sided destribution of sugar; however,it gives us the idea that premium quality contains almost no sugar. Closer look at sugar mean should help to understand the distributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

We see some inconsistancy with data because of maximum of it it 66 where as mean is only 6.4. That exlains the concentration of data on left side.

What about chlorides?

The dataset has some outliers, but over all shape of destribution looks normal. Where as box plot has pritty much the same avarage values for qualities. However, we see interesting pattern here, white wines with better quality tend to have less chlorides.

Looking for correlation between variables

First, wee need to exclude variables that do not have any correlation. Start from pH level and look at its correlection with other variables, which we analyzing (acidity, quality, sugar and chlorides). Our results were that pH level has almost no correlations, but the plot shows strong negative red colored correlation with Fixed.acidity. After looking at acidity in our data, we find a correlation with density. Moreover, we dont anticipate to look into dencity untill perform correlation analytics:

In addition to density, we have found that alcohol correlates very strongly with quality.

And since we are looking into what factors effect quality of white wine, we have to explore this relationship closer. A side of that, we need to look for patterns or relarionships other variable to alcohol.

Ploting will help us see it:

The results is that alcohol does not depends on pH, but sugar and chlorides have some patterns related to alcohol. let’s try to see it in the range of high qualities (7-9):

Here we see that higher quality white wine contains more than 11 percent of alcohol and main concentration of sugar is between 0 and 15. Let’s specify sugar level around 10 and alcohol more than 11.

We see that the plot shows better picture of location of sugar level. Let’s look at it closer by taking everything below 5 mark. When we check it, we have noticed that more data is below 3 mark. Let’s explore it:

Interesing, we see that high level of alcohol less sugar it contains in our data. However, it may be an issue of data, not pattern. That is really interesing findings.

Predicting Wine Quality

By using informayion that we gather from analyzing patterns in wine quality to predict it:

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

We has accuracy - highest value of rows in dataset devided by sum of all rows values: 2198/(20+163+1457+2198+880+175+5) = 0.4487

Looking into our analytics, we saw that twe have a strong relationship between wine quality and its alcohol percentage, so it makes sence to predict the quality of wine just based on its alcohol percentage

## # weights:  21 (12 variable)
## initial  value 9531.067910 
## iter  10 value 5944.962833
## iter  20 value 5728.431041
## iter  30 value 5727.493243
## iter  40 value 5727.393430
## final  value 5727.379706 
## converged
##    pred
##        3    4    5    6    7    8    9
##   3    0    0    5   15    0    0    0
##   4    0    0   57  103    3    0    0
##   5    0    0  738  710    9    0    0
##   6    0    0  523 1578   97    0    0
##   7    0    0  100  636  144    0    0
##   8    0    0   18  120   37    0    0
##   9    0    0    0    3    2    0    0

Here we that most our predictions alines arpund middle values of quality 5 - 7. It gives me idea that our dataset does not have enough information on high quality of white wine.

Since we were trying to find our factors that effect high quality of white wine, we obviously can get more data on that and do more analytics based on that new factors.

Final plots and Summary

We explore wine quality first to determent were our data alines.

As the result, we see that most of your white wines have quality between 5 and 7. However, we wanted to learn about the highest quality factors. In analytics section, we determened that for high quality: pH level a range between 3.2 and 3.3, almost no sugar involved and alcohol level more or equals 11 persent. What about density that we learn at correlation level. Here we can plot density and alcohol dependency of quality factor for high levels:

We found out that the higher the alcohol percentage means the lower is the density for all qualities. We already know that correlation of alcohol and quality is 0.43.

Is the same true for sugar and density of high qualities?

This seems not true for sugar and density of white wine because we see more sugar takes more density. We already know that correlation of sugar and quality is 0.84 that is really strong.

Based on this information we can state that: 1. High quality of wine has higher alcohol and pH levels where as lower density and sugar levels. 2. There was not enough data on high quality white wine, most information in given dataset was between 5 and 7 levels of quality and may have an issue with sugar patterns due to small size of data. 3. Our prediction gave us numbers on most presented quality levels, not those that we were looking for. 4. Overall data distributions are closer to normal.

Reflection

After making statement, I am convinced that Alcohol percentage is one of the most important factors to decide on the quality of white wine. Moveover, remaining sugar contributes to it in wine on if more sugar left after fermentation, the less the percentage of alcohol can be found in a wine. There were more factors effecting the quality and we gave a pritty good picture for them in the body. However, for future analytics, I would suggest take more data with higher qualities and look for patterns with other factors like acidity and sulphates more closer because we definatelly will have interesing results there.

The main chanlenge for me was find out what I want to look at, what goals I should set up for me by analyzing this dataset; finally, what result I can expect. Looking into data, helped me finalize my goals and make easier to see patterns. After, I looked closer into some data correlation variables, I still had questionable results that needs to be adddressed in a future.

Regarding overall analytics, because of small size of dataset we were able to do it fairly quick, but if your data is big, we have to split it or samplify to run analytics and only then look for particular patterns not test for everything.

References

You can finf them in my text.